Skip to content

feat: Dubai villa lead scraper + Apify bridge + Sheets sync#1

Open
vishnu-madhavan-git wants to merge 1 commit intomainfrom
feature/dubai-villa-scraper
Open

feat: Dubai villa lead scraper + Apify bridge + Sheets sync#1
vishnu-madhavan-git wants to merge 1 commit intomainfrom
feature/dubai-villa-scraper

Conversation

@vishnu-madhavan-git
Copy link
Copy Markdown
Collaborator

@vishnu-madhavan-git vishnu-madhavan-git commented Mar 8, 2026

Summary

Adds automated Dubai villa owner lead collection for IXR interior design client acquisition.

New files

  • agents/dubai_villa_scraper.py — stdlib-only scraper (PropertyFinder + Bayut)
  • agents/apify_dubai_scraper.py — Apify-powered scraper (faster, handles anti-bot)
  • core/leads-bridge.js — syncs leads to Google Sheets CRM

Usage

# Basic (no API key needed)
python agents/dubai_villa_scraper.py --area "Palm Jumeirah" --max 30

# Apify (faster, needs APIFY_TOKEN in .env)
python agents/apify_dubai_scraper.py --max 100

Sheets sync

const { syncLeadsToSheets } = require('./core/leads-bridge');
await syncLeadsToSheets();

Open with Devin

Summary by CodeRabbit

  • New Features

    • Dubai villa lead scraper: Collect leads from PropertyFinder and Bayut without API key requirements.
    • Apify scraper integration: Enhanced scraping via Apify's Dubai Real Estate actor (requires APIFY_TOKEN).
    • Google Sheets sync: Automatically sync scraped leads to Google Sheets, deduplicating by contact information.
  • Documentation

    • Updated README with villa scraper setup, usage examples, and Sheets integration instructions.

- agents/dubai_villa_scraper.py: stdlib-only scraper for PropertyFinder + Bayut
  - Extracts UAE phone numbers, owner names, areas, prices
  - Deduplicates by phone number
  - Outputs JSON to data/state/villa_leads.json

- agents/apify_dubai_scraper.py: Apify actor bridge (faster path)
  - Uses redoubtable_bubble/dubai-real-estate-scraper actor
  - Handles anti-bot automatically
  - Requires APIFY_TOKEN in .env

- core/leads-bridge.js: syncs villa_leads.json to Google Sheets CRM
  - Deduplicates synced leads
  - Uses existing SheetsService pattern

Use case: IXR interior design client acquisition from Dubai villa owners
Copilot AI review requested due to automatic review settings March 8, 2026 02:37
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 8, 2026

📝 Walkthrough

Walkthrough

A Dubai villa lead scraping system is added with multiple collection options: PropertyFinder/Bayut (no API key required) and Apify-based scraping. Includes deduplication logic, JSON persistence at data/state/villa_leads.json, and a Google Sheets sync bridge via core/leads-bridge.js. Documentation and dependencies updated accordingly.

Changes

Cohort / File(s) Summary
Documentation & Configuration
README.md, agents/requirements.txt
Added Dubai Villa Lead Scraper section with usage examples, environment setup, and clarified dependency notes (stdlib-only base, optional requests/beautifulsoup4).
Villa Scraping Agents
agents/dubai_villa_scraper.py, agents/apify_dubai_scraper.py
Implemented two scraping strategies: PropertyFinder/Bayut direct web scraping with retry/backoff, and Apify actor integration. Both include lead normalization, phone extraction, deduplication against existing leads, and JSON persistence.
Lead Sync Integration
core/leads-bridge.js
New Leads Bridge module that filters synced leads by phone, initializes Sheets integration, and asynchronously syncs new villa leads with area, price, source, and URL as notes.

Sequence Diagrams

sequenceDiagram
    participant DLS as Dubai Villa Scraper
    participant PF as PropertyFinder/Bayut
    participant Store as JSON Store
    participant Dedup as Deduplicator
    
    DLS->>PF: Fetch listings (area, pagination)
    PF-->>DLS: HTML response
    DLS->>DLS: Parse & extract leads<br/>(phone, price, URL, area)
    DLS->>Store: Load existing leads
    Store-->>DLS: villa_leads.json
    DLS->>Dedup: Check for duplicates<br/>(by phone number)
    Dedup-->>DLS: New leads, dup count
    DLS->>Store: Append & save updated list
    Store-->>DLS: Persisted
    DLS->>DLS: Output summary JSON<br/>(new, total, duplicates)
Loading
sequenceDiagram
    participant Bridge as Leads Bridge
    participant VL as villa_leads.json
    participant SL as synced_leads.json
    participant Sheets as Google Sheets
    
    Bridge->>VL: Read all villa leads
    VL-->>Bridge: Lead array
    Bridge->>SL: Load synced phones
    SL-->>Bridge: Phone list
    Bridge->>Bridge: Filter new leads<br/>(exclude synced)
    Bridge->>Sheets: Initialize API
    loop For each new lead
        Bridge->>Sheets: Sync lead<br/>(name, area, price, URL)
        Sheets-->>Bridge: Success/Failure
    end
    Bridge->>SL: Append synced phones
    SL-->>Bridge: Updated
    Bridge->>Bridge: Return sync summary
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 In Dubai's desert, where villas bloom,
I scrape the leads from room to room,
PropertyFinder, Bayut, or Apify's call,
Dedup them quick, sync to Sheets—I've got 'em all!
Phone numbers dance, coordinates gleam,
A villa lead symphony, coded in dream! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: adding a Dubai villa lead scraper, Apify integration, and Google Sheets sync functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/dubai-villa-scraper

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 4 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment thread core/leads-bridge.js
}

async function syncLeadsToSheets() {
const sheetsService = new (require("./sheets"))();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 new (require('./sheets'))() fails because sheets.js exports a singleton instance, not a class

core/sheets.js:74 exports module.exports = new SheetsService() — a singleton instance, not the SheetsService class. In core/leads-bridge.js:30, new (require('./sheets'))() attempts to call new on that instance, which will throw TypeError: (intermediate value) is not a constructor at runtime. The function syncLeadsToSheets() can never succeed.

Suggested change
const sheetsService = new (require("./sheets"))();
const sheetsService = require("./sheets");
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread core/leads-bridge.js
Comment on lines +57 to +59
// Mark all as synced
const newSynced = [...loadSynced(), ...unsynced];
saveSynced(newSynced);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Failed-to-sync leads are incorrectly marked as synced

At core/leads-bridge.js:58, all unsynced leads are unconditionally added to the synced list, even if sheetsService.addLead() threw an error for some of them (caught at line 52). This means leads that failed to sync to Google Sheets will never be retried on subsequent runs, since they'll already be in synced_leads.json. The code should only mark successfully synced leads.

Prompt for agents
In core/leads-bridge.js, lines 43-59: Track which leads were successfully synced in an array (e.g., `successfullySynced`), push each lead to it inside the try block after the synced++ increment (around line 51), and then on line 58, use that array instead of `unsynced` when building `newSynced`. This ensures only leads that were actually synced to Google Sheets get marked in synced_leads.json.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +107 to +112
def fetch_results(token: str, run_id: str) -> list[dict]:
"""Fetch results from completed actor run."""
result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
items = result.get("data", {}).get("items", [])
log(f"Fetched {len(items)} items from Apify")
return items
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Apify dataset items API returns a JSON array, not a {data: {items: []}} object

The Apify API endpoint /v2/actor-runs/{runId}/dataset/items returns a raw JSON array of items, not an object with a data.items wrapper. At agents/apify_dubai_scraper.py:109-110, apify_request parses the response with json.loads() which yields a Python list. Then result.get("data", {}).get("items", []) will raise AttributeError: 'list' object has no attribute 'get' because lists don't have .get(). This means fetch_results() will always crash after a successful actor run.

Suggested change
def fetch_results(token: str, run_id: str) -> list[dict]:
"""Fetch results from completed actor run."""
result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
items = result.get("data", {}).get("items", [])
log(f"Fetched {len(items)} items from Apify")
return items
def fetch_results(token: str, run_id: str) -> list[dict]:
"""Fetch results from completed actor run."""
result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
if isinstance(result, list):
items = result
else:
items = result.get("data", {}).get("items", [])
log(f"Fetched {len(items)} items from Apify")
return items
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +219 to +220
encoded_area = urllib.parse.quote(area) if area else "dubai"
url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bayut URL construction double-encodes area: urllib.parse.quote() then .replace(' ', '-') is a no-op

At agents/dubai_villa_scraper.py:219-220, when area is provided, encoded_area = urllib.parse.quote(area) converts spaces to %20 (e.g., "Palm Jumeirah" → "Palm%20Jumeirah"). The subsequent .lower().replace(' ', '-') on line 220 is then a no-op since there are no literal spaces left, producing a URL like .../villa/palm%20jumeirah/ instead of the expected .../villa/palm-jumeirah/. This will likely return a 404 or wrong results from Bayut.

Suggested change
encoded_area = urllib.parse.quote(area) if area else "dubai"
url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1"
area_slug = area.lower().replace(" ", "-") if area else "dubai"
url = f"https://www.bayut.com/for-rent/villa/{area_slug}/?owner_only=1"
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an automated lead-generation pipeline for Dubai villa listings, including two Python-based scrapers (stdlib + Apify) and a Node bridge that syncs collected leads into the existing Google Sheets CRM integration.

Changes:

  • Add agents/dubai_villa_scraper.py to scrape PropertyFinder/Bayut and persist leads to data/state/villa_leads.json.
  • Add agents/apify_dubai_scraper.py to pull owner-contact leads via an Apify actor and persist them in the same state file.
  • Add core/leads-bridge.js plus README updates to sync stored leads into Google Sheets.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
core/leads-bridge.js New bridge that loads leads from data/state and pushes unsynced ones to Google Sheets.
agents/requirements.txt Updates comments describing scraper dependency expectations.
agents/dubai_villa_scraper.py New stdlib-only scraper that collects and deduplicates leads into state JSON.
agents/apify_dubai_scraper.py New Apify API bridge that runs an actor, normalizes results, and writes leads to state JSON.
README.md Documents how to run both scrapers and how to sync leads to Sheets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread core/leads-bridge.js
}

async function syncLeadsToSheets() {
const sheetsService = new (require("./sheets"))();
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core/sheets.js exports an already-instantiated SheetsService (module.exports = new SheetsService()), but this code treats it like a constructor (new (require("./sheets"))()). That will throw at runtime. Use the exported instance directly and call init()/addLead() on it.

Suggested change
const sheetsService = new (require("./sheets"))();
const sheetsService = sheets;

Copilot uses AI. Check for mistakes.
Comment thread core/leads-bridge.js
Comment on lines +57 to +59
// Mark all as synced
const newSynced = [...loadSynced(), ...unsynced];
saveSynced(newSynced);
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This marks all unsynced leads as synced even if some addLead calls fail (errors are caught but the lead is still appended to synced_leads.json). This will prevent retries and can permanently drop leads from being synced. Track only successfully synced leads (or update the synced file incrementally on each success) before writing synced_leads.json.

Copilot uses AI. Check for mistakes.
Comment on lines +219 to +220
encoded_area = urllib.parse.quote(area) if area else "dubai"
url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1"
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bayut URL construction mixes quote() with a slug-style path. For areas like "Palm Jumeirah", urllib.parse.quote(area) yields Palm%20Jumeirah, so the URL becomes /villa/palm%20jumeirah/… rather than the expected hyphenated slug. Build the slug first (lowercase + replace spaces with '-') and then quote (or avoid quoting if the slug is URL-safe) to ensure the listings page resolves correctly.

Suggested change
encoded_area = urllib.parse.quote(area) if area else "dubai"
url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1"
if area:
# Build a Bayut-style slug: lowercase and hyphen-separated, then URL-encode if needed.
area_slug = area.strip().lower().replace(" ", "-")
encoded_area = urllib.parse.quote(area_slug, safe="-")
else:
encoded_area = "dubai"
url = f"https://www.bayut.com/for-rent/villa/{encoded_area}/?owner_only=1"

Copilot uses AI. Check for mistakes.
Comment thread core/leads-bridge.js
Comment on lines +7 to +8
const sheets = require("./sheets");

Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const sheets = require("./sheets"); is unused (and syncLeadsToSheets requires the module again later). After fixing the SheetsService instantiation, keep a single import and use it to avoid duplication and dead code.

Copilot uses AI. Check for mistakes.
Comment on lines +115 to +117
area_slug = area.lower().replace(" ", "-") if area else ""
base_url = "https://www.propertyfinder.ae/en/search?c=2&t=1&fu=1&rp=y"
if area_slug:
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

area_slug is computed but never used. Remove it or use it consistently when building the query/location parameter to keep the scraping logic clear.

Suggested change
area_slug = area.lower().replace(" ", "-") if area else ""
base_url = "https://www.propertyfinder.ae/en/search?c=2&t=1&fu=1&rp=y"
if area_slug:
base_url = "https://www.propertyfinder.ae/en/search?c=2&t=1&fu=1&rp=y"
if area:

Copilot uses AI. Check for mistakes.
Comment on lines +135 to +140
# Simpler extraction - look for villa data in JSON-LD or meta tags
# PropertyFinder embeds listing data as JSON
json_matches = re.findall(r'window\.__INITIAL_STATE__\s*=\s*({.*?});\s*</script>', html, re.DOTALL)
if not json_matches:
json_matches = re.findall(r'"properties"\s*:\s*(\[.*?\])', html, re.DOTALL)

Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json_matches is computed but never used. If the intent is to parse window.__INITIAL_STATE__ for structured listing data (more robust than regex), consider implementing that; otherwise remove this block to reduce dead code.

Suggested change
# Simpler extraction - look for villa data in JSON-LD or meta tags
# PropertyFinder embeds listing data as JSON
json_matches = re.findall(r'window\.__INITIAL_STATE__\s*=\s*({.*?});\s*</script>', html, re.DOTALL)
if not json_matches:
json_matches = re.findall(r'"properties"\s*:\s*(\[.*?\])', html, re.DOTALL)

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +66
def fetch_url(url: str, headers: dict = None) -> str | None:
"""Fetch a URL with retry logic."""
default_headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetch_url uses the PEP 604 union type (str | None), which requires Python 3.10+. The repo README currently targets “system python” without a minimum version, and other agents avoid this syntax. Consider switching to Optional[str] (and similar) or explicitly documenting/enforcing Python >= 3.10 for agents.

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +107
for p in phones:
clean = re.sub(r'[\s\-]', '', p)
if clean not in normalized:
normalized.append(clean)
return normalized
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deduplicate() keys off lead["phone"], but normalization here only strips spaces/dashes. The same UAE number can appear as 00971..., +971..., or local 05... and won’t dedupe correctly. Normalize to a single canonical format (e.g. E.164 +971...) before adding to normalized.

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +66
def run_actor(token: str, area: str, max_items: int, property_type: str = "villa") -> str | None:
"""Start the Apify actor run and return run ID."""
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file also uses PEP 604 union types in return annotations (e.g. str | None, dict | None), which require Python 3.10+. If agents are intended to run on an unspecified “system python”, consider using Optional[...]/Union[...] or documenting/enforcing Python >= 3.10.

Copilot uses AI. Check for mistakes.
def fetch_results(token: str, run_id: str) -> list[dict]:
"""Fetch results from completed actor run."""
result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
items = result.get("data", {}).get("items", [])
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetch_results assumes apify_request returns an object with data.items, but the dataset items endpoint commonly returns a raw JSON array. If a list is returned, result.get(...) will throw and the scraper will crash. Adjust apify_request/fetch_results to handle a list response (or request a response shape that’s always an object).

Suggested change
items = result.get("data", {}).get("items", [])
# The Apify dataset items endpoint may return either:
# - a raw JSON array of items, or
# - an object that wraps items under data.items.
if isinstance(result, list):
items = result
elif isinstance(result, dict):
data = result.get("data", result)
if isinstance(data, dict):
items = data.get("items", [])
elif isinstance(data, list):
items = data
else:
items = []
else:
items = []

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agents/apify_dubai_scraper.py`:
- Around line 7-12: The script documents that APIFY_TOKEN can live in a .env but
only reads os.environ; fix by loading .env at module start (before any env
access) using python-dotenv: add "from dotenv import load_dotenv" and call
"load_dotenv()" near the top of the file before retrieving APIFY_TOKEN (and
similarly before the env checks around the block referenced at lines ~175-180).
Alternatively, update the CLI docs to remove the .env claim — but the preferred
fix is to call load_dotenv() before accessing APIFY_TOKEN so os.environ sees
variables from .env.
- Around line 148-158: Concurrent runs overwrite the shared JSON leads store
because load_existing_leads and save_leads read the whole file, mutate
in-memory, and rewrite it; change these functions to perform concurrency-safe
updates (e.g., acquire a file lock around read-modify-write or switch to an
append-only/JSONL writer) so overlapping scrapers don't lose data. Specifically,
update load_existing_leads, save_leads and any callers that append to LEADS_FILE
so they obtain an exclusive lock on LEADS_FILE (or open it in append mode for
JSONL) before reading/writing, merge new leads into the existing set safely, and
use atomic replace/rename when writing to avoid partial writes; ensure the same
locking/format is used by agents/dubai_villa_scraper.py to keep behavior
consistent.
- Around line 107-111: The fetch_results function assumes apify_request returned
a dict and calls .get on it, but the /actor-runs/{run_id}/dataset/items endpoint
returns a raw list; update fetch_results to handle both shapes by checking the
type of the response from apify_request (called in fetch_results) and set items
= result if it's a list, otherwise fall back to result.get("data",
{}).get("items", []); keep the log(f"Fetched {len(items)} items from Apify") and
ensure the function returns the items list.

In `@agents/dubai_villa_scraper.py`:
- Around line 49-59: The read-modify-write on LEADS_FILE in
load_existing_leads/save_leads causes lost-update races; wrap the entire
sequence that reads, merges new leads, and writes back with a file-level lock
(e.g., using a FileLock/portalocker) so concurrent agents serialize access to
LEADS_FILE, perform a read->merge-by-unique-id->write under the same lock, and
write atomically (write to temp + os.replace) to avoid corruption; apply the
same locking/merge pattern to the other lead-write logic referenced around the
second block (lines 302-319) and reference the functions load_existing_leads and
save_leads when locating the code to change.
- Around line 219-237: The scraper only fetches the first Bayut search page so
it under-collects when max_results is larger; modify the listing collection
around encoded_area/url/fetch_url/listing_urls to paginate (add a page
parameter, e.g. append "&page={page}" to the existing query), loop fetching
pages starting at page=1, extract and accumulate unique listing URLs into
listing_urls until you have >= max_results or no new URLs are returned, and then
slice to max_results before iterating to create leads; ensure you break on empty
pages and update logs to show pages fetched.

In `@core/leads-bridge.js`:
- Around line 43-59: The code currently appends all unsynced leads to persistent
storage regardless of whether sheetsService.addLead succeeded; change the logic
to only persist leads that were actually synced by collecting successful leads
(e.g., push to a local array inside the try block or incrementally build
syncedLeads) and then call saveSynced([...loadSynced(), ...syncedLeads]) instead
of saveSynced([...loadSynced(), ...unsynced]); keep the existing try/catch
around sheetsService.addLead and error logging for failed attempts so transient
failures remain retryable.
- Around line 29-31: The syncLeadsToSheets function instantiates the Sheets
module with new (require("./sheets"))() which is incorrect because
core/sheets.js exports a singleton; change the code to require/import the
exported sheets instance (the exported symbol from "./sheets") and use that
instance (e.g., sheets.init()) instead of creating a new object; if you actually
need the class, update core/sheets.js to also export the class constructor and
require that specific export, but by default reuse the exported sheets singleton
rather than calling new on require("./sheets").

In `@README.md`:
- Around line 57-70: Update the README to explicitly require Python 3.10+ for
running the scraper commands: add a short note above the examples stating
"Requires Python 3.10+" (or "Python 3.10 or later") because the agents
(agents/dubai_villa_scraper.py and agents/apify_dubai_scraper.py) use PEP 604
union syntax (e.g., str | None) and newer parameterized built-ins; also keep the
APIFY_TOKEN note for the Apify scraper so users know environment setup
requirements.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aee2bf30-f927-4cf6-aaf6-4e1e0c9efd24

📥 Commits

Reviewing files that changed from the base of the PR and between f0349c5 and 883f1f8.

📒 Files selected for processing (5)
  • README.md
  • agents/apify_dubai_scraper.py
  • agents/dubai_villa_scraper.py
  • agents/requirements.txt
  • core/leads-bridge.js

Comment on lines +7 to +12
This is the FAST path - uses Apify's actor which handles anti-bot measures.
Requires APIFY_TOKEN in .env

Usage:
python agents/apify_dubai_scraper.py
python agents/apify_dubai_scraper.py --area "Palm Jumeirah" --max 100
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Load .env here or change the documented CLI contract.

The docstring says APIFY_TOKEN can live in .env, but this entry point only checks os.environ. Running python agents/apify_dubai_scraper.py as documented will still hit APIFY_TOKEN not set unless the caller exported the variable first.

Also applies to: 175-180

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents/apify_dubai_scraper.py` around lines 7 - 12, The script documents that
APIFY_TOKEN can live in a .env but only reads os.environ; fix by loading .env at
module start (before any env access) using python-dotenv: add "from dotenv
import load_dotenv" and call "load_dotenv()" near the top of the file before
retrieving APIFY_TOKEN (and similarly before the env checks around the block
referenced at lines ~175-180). Alternatively, update the CLI docs to remove the
.env claim — but the preferred fix is to call load_dotenv() before accessing
APIFY_TOKEN so os.environ sees variables from .env.

Comment on lines +107 to +111
def fetch_results(token: str, run_id: str) -> list[dict]:
"""Fetch results from completed actor run."""
result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
items = result.get("data", {}).get("items", [])
log(f"Fetched {len(items)} items from Apify")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find and inspect the apify_dubai_scraper.py file
find . -name "apify_dubai_scraper.py" -type f

Repository: vishnu-madhavan-git/automation

Length of output: 105


🌐 Web query:

Apify API dataset-items endpoint response format /actor-runs/{runId}/dataset/items

💡 Result:

GET /v2/actor-runs/{runId}/dataset/items is the Actor-run “default dataset” items endpoint and it behaves the same as the regular Get dataset items endpoint (same query params + same output formats). (docs.apify.com)

Default response (no format param)

  • format=json (default): returns a JSON array of item objects, e.g. [{...},{...}]. (docs.apify.com)

Other response formats (format=...)

  • format=jsonl: newline-delimited JSON (one JSON object per line, ending with \n). (docs.apify.com)
  • format=csv: CSV text (header\nrow\n...), with optional UTF‑8 BOM behavior controlled by bom=0|1. (docs.apify.com)
  • Also supported: html, xlsx, xml, rss (returned as the corresponding textual/binary representation). (docs.apify.com)

Pagination metadata (via response headers)

The items response includes these headers:

  • X-Apify-Pagination-Offset
  • X-Apify-Pagination-Limit
  • X-Apify-Pagination-Count
  • X-Apify-Pagination-Total (docs.apify.com)

Common query params that affect the output shape

  • offset, limit
  • fields (pick only listed fields), omit
  • unwind, flatten
  • clean (shortcut for skipHidden=true + skipEmpty=true) (docs.apify.com)

Citations:


🏁 Script executed:

# Search for the apify_request function to understand response handling
rg "def apify_request" -A 10 -t py

Repository: vishnu-madhavan-git/automation

Length of output: 978


🏁 Script executed:

# Inspect the actual fetch_results function in the file to confirm code and line numbers
cat -n agents/apify_dubai_scraper.py | sed -n '100,120p'

Repository: vishnu-madhavan-git/automation

Length of output: 936


🏁 Script executed:

# Also check the full context around the function to understand usage
cat -n agents/apify_dubai_scraper.py | sed -n '1,30p'

Repository: vishnu-madhavan-git/automation

Length of output: 1126


🏁 Script executed:

# Check the complete apify_request function to see full implementation
cat -n agents/apify_dubai_scraper.py | sed -n '40,60p'

Repository: vishnu-madhavan-git/automation

Length of output: 1041


🏁 Script executed:

# Verify if there are any other usages of apify_request to understand the response pattern
rg "apify_request" agents/apify_dubai_scraper.py -B 1 -A 1

Repository: vishnu-madhavan-git/automation

Length of output: 653


Treat the dataset-items response as a raw list.

Apify's /actor-runs/{runId}/dataset/items endpoint returns a JSON array directly (e.g. [{...},{...}]), not a wrapped object. The current code calls .get("data", {}) on the response, which will fail with AttributeError: 'list' object has no attribute 'get'. This differs from other Apify endpoints in this file (e.g. /actor-runs/{run_id}) which return wrapped responses.

Proposed fix
 def fetch_results(token: str, run_id: str) -> list[dict]:
     """Fetch results from completed actor run."""
-    result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
-    items = result.get("data", {}).get("items", [])
+    items = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
+    if not isinstance(items, list):
+        log(f"Unexpected dataset response: {items}")
+        return []
     log(f"Fetched {len(items)} items from Apify")
     return items
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def fetch_results(token: str, run_id: str) -> list[dict]:
"""Fetch results from completed actor run."""
result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
items = result.get("data", {}).get("items", [])
log(f"Fetched {len(items)} items from Apify")
def fetch_results(token: str, run_id: str) -> list[dict]:
"""Fetch results from completed actor run."""
items = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
if not isinstance(items, list):
log(f"Unexpected dataset response: {items}")
return []
log(f"Fetched {len(items)} items from Apify")
return items
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents/apify_dubai_scraper.py` around lines 107 - 111, The fetch_results
function assumes apify_request returned a dict and calls .get on it, but the
/actor-runs/{run_id}/dataset/items endpoint returns a raw list; update
fetch_results to handle both shapes by checking the type of the response from
apify_request (called in fetch_results) and set items = result if it's a list,
otherwise fall back to result.get("data", {}).get("items", []); keep the
log(f"Fetched {len(items)} items from Apify") and ensure the function returns
the items list.

Comment on lines +148 to +158
def load_existing_leads() -> list:
if LEADS_FILE.exists():
try:
return json.loads(LEADS_FILE.read_text(encoding="utf-8"))
except Exception:
return []
return []


def save_leads(leads: list) -> None:
LEADS_FILE.write_text(json.dumps(leads, indent=2, ensure_ascii=False), encoding="utf-8")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Serialize writes to the shared leads store.

This code loads data/state/villa_leads.json, appends in memory, and rewrites the whole file. agents/dubai_villa_scraper.py does the same against the same path, so overlapping runs will drop whichever scraper saves first.

Also applies to: 198-201

🧰 Tools
🪛 Ruff (0.15.4)

[warning] 152-152: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents/apify_dubai_scraper.py` around lines 148 - 158, Concurrent runs
overwrite the shared JSON leads store because load_existing_leads and save_leads
read the whole file, mutate in-memory, and rewrite it; change these functions to
perform concurrency-safe updates (e.g., acquire a file lock around
read-modify-write or switch to an append-only/JSONL writer) so overlapping
scrapers don't lose data. Specifically, update load_existing_leads, save_leads
and any callers that append to LEADS_FILE so they obtain an exclusive lock on
LEADS_FILE (or open it in append mode for JSONL) before reading/writing, merge
new leads into the existing set safely, and use atomic replace/rename when
writing to avoid partial writes; ensure the same locking/format is used by
agents/dubai_villa_scraper.py to keep behavior consistent.

Comment on lines +49 to +59
def load_existing_leads() -> list:
if LEADS_FILE.exists():
try:
return json.loads(LEADS_FILE.read_text(encoding="utf-8"))
except Exception:
return []
return []


def save_leads(leads: list) -> None:
LEADS_FILE.write_text(json.dumps(leads, indent=2, ensure_ascii=False), encoding="utf-8")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The shared leads store still has a lost-update race.

Like the Apify agent, this flow rewrites data/state/villa_leads.json via a read-modify-write sequence with no locking. If both agents overlap, the later write drops the earlier agent’s newly found leads.

Also applies to: 302-319

🧰 Tools
🪛 Ruff (0.15.4)

[warning] 53-53: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents/dubai_villa_scraper.py` around lines 49 - 59, The read-modify-write on
LEADS_FILE in load_existing_leads/save_leads causes lost-update races; wrap the
entire sequence that reads, merges new leads, and writes back with a file-level
lock (e.g., using a FileLock/portalocker) so concurrent agents serialize access
to LEADS_FILE, perform a read->merge-by-unique-id->write under the same lock,
and write atomically (write to temp + os.replace) to avoid corruption; apply the
same locking/merge pattern to the other lead-write logic referenced around the
second block (lines 302-319) and reference the functions load_existing_leads and
save_leads when locating the code to change.

Comment on lines +219 to +237
encoded_area = urllib.parse.quote(area) if area else "dubai"
url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1"

log(f"Bayut: starting scrape (area={area or 'Dubai'}, max={max_results})")
html = fetch_url(url)

if not html:
log("Bayut: failed to fetch listings page")
return leads

# Extract listing links
listing_urls = re.findall(r'"(https://www\.bayut\.com/property/[^"]+)"', html)
if not listing_urls:
listing_urls = re.findall(r'href="(/property/[^"]+)"', html)
listing_urls = [f"https://www.bayut.com{u}" for u in listing_urls]

log(f"Bayut: found {len(listing_urls)} listing URLs")

for listing_url in listing_urls[:max_results]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Paginate Bayut results before honoring --max.

This implementation requests exactly one Bayut search page and never advances a page parameter. Any run that needs more leads than that first page exposes will silently under-collect, even though --max suggests otherwise.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents/dubai_villa_scraper.py` around lines 219 - 237, The scraper only
fetches the first Bayut search page so it under-collects when max_results is
larger; modify the listing collection around
encoded_area/url/fetch_url/listing_urls to paginate (add a page parameter, e.g.
append "&page={page}" to the existing query), loop fetching pages starting at
page=1, extract and accumulate unique listing URLs into listing_urls until you
have >= max_results or no new URLs are returned, and then slice to max_results
before iterating to create leads; ensure you break on empty pages and update
logs to show pages fetched.

Comment thread core/leads-bridge.js
Comment on lines +29 to +31
async function syncLeadsToSheets() {
const sheetsService = new (require("./sheets"))();
const initialized = await sheetsService.init();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use the exported Sheets singleton instead of constructing it.

core/sheets.js already exports a service instance. Calling new (require("./sheets"))() here will throw before any sync happens; reuse the imported sheets object or export the class separately.

Proposed fix
 async function syncLeadsToSheets() {
-  const sheetsService = new (require("./sheets"))();
+  const sheetsService = sheets;
   const initialized = await sheetsService.init();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/leads-bridge.js` around lines 29 - 31, The syncLeadsToSheets function
instantiates the Sheets module with new (require("./sheets"))() which is
incorrect because core/sheets.js exports a singleton; change the code to
require/import the exported sheets instance (the exported symbol from
"./sheets") and use that instance (e.g., sheets.init()) instead of creating a
new object; if you actually need the class, update core/sheets.js to also export
the class constructor and require that specific export, but by default reuse the
exported sheets singleton rather than calling new on require("./sheets").

Comment thread core/leads-bridge.js
Comment on lines +43 to +59
let synced = 0;
for (const lead of unsynced) {
try {
await sheetsService.addLead({
name: lead.name,
phone: lead.phone,
notes: `Area: ${lead.area} | Price: ${lead.price} | Source: ${lead.source} | ${lead.url}`
});
synced++;
} catch (err) {
console.error(`[leads-bridge] Failed to sync ${lead.phone}:`, err.message);
}
}

// Mark all as synced
const newSynced = [...loadSynced(), ...unsynced];
saveSynced(newSynced);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Only persist leads that actually synced.

saveSynced() currently appends every unsynced lead, even when addLead() threw. A transient Sheets failure will therefore permanently suppress retries for that lead.

Proposed fix
   let synced = 0;
+  const syncedNow = [];
   for (const lead of unsynced) {
     try {
       await sheetsService.addLead({
         name: lead.name,
         phone: lead.phone,
         notes: `Area: ${lead.area} | Price: ${lead.price} | Source: ${lead.source} | ${lead.url}`
       });
       synced++;
+      syncedNow.push(lead);
     } catch (err) {
       console.error(`[leads-bridge] Failed to sync ${lead.phone}:`, err.message);
     }
   }
 
   // Mark all as synced
-  const newSynced = [...loadSynced(), ...unsynced];
+  const newSynced = [...loadSynced(), ...syncedNow];
   saveSynced(newSynced);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let synced = 0;
for (const lead of unsynced) {
try {
await sheetsService.addLead({
name: lead.name,
phone: lead.phone,
notes: `Area: ${lead.area} | Price: ${lead.price} | Source: ${lead.source} | ${lead.url}`
});
synced++;
} catch (err) {
console.error(`[leads-bridge] Failed to sync ${lead.phone}:`, err.message);
}
}
// Mark all as synced
const newSynced = [...loadSynced(), ...unsynced];
saveSynced(newSynced);
let synced = 0;
const syncedNow = [];
for (const lead of unsynced) {
try {
await sheetsService.addLead({
name: lead.name,
phone: lead.phone,
notes: `Area: ${lead.area} | Price: ${lead.price} | Source: ${lead.source} | ${lead.url}`
});
synced++;
syncedNow.push(lead);
} catch (err) {
console.error(`[leads-bridge] Failed to sync ${lead.phone}:`, err.message);
}
}
// Mark all as synced
const newSynced = [...loadSynced(), ...syncedNow];
saveSynced(newSynced);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/leads-bridge.js` around lines 43 - 59, The code currently appends all
unsynced leads to persistent storage regardless of whether sheetsService.addLead
succeeded; change the logic to only persist leads that were actually synced by
collecting successful leads (e.g., push to a local array inside the try block or
incrementally build syncedLeads) and then call saveSynced([...loadSynced(),
...syncedLeads]) instead of saveSynced([...loadSynced(), ...unsynced]); keep the
existing try/catch around sheetsService.addLead and error logging for failed
attempts so transient failures remain retryable.

Comment thread README.md
Comment on lines +57 to +70
## Dubai Villa Lead Scraper

Two scraper agents for collecting direct villa owner contacts in Dubai:

### Basic Scraper (no API key needed)
```bash
python agents/dubai_villa_scraper.py --source both --area "Palm Jumeirah" --max 30
```

### Apify Scraper (faster, more reliable)
Requires `APIFY_TOKEN` in `.env`
```bash
python agents/apify_dubai_scraper.py --area "Emirates Hills" --max 100
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find the agents directory and list its contents
find . -name "agents" -type d | head -20
git ls-files | grep -E "agents/" | head -20

Repository: vishnu-madhavan-git/automation

Length of output: 182


🏁 Script executed:

# Check if the agent files exist
ls -la agents/ 2>/dev/null || echo "agents/ directory not found at root"

Repository: vishnu-madhavan-git/automation

Length of output: 465


🏁 Script executed:

# Search for these specific files
find . -name "dubai_villa_scraper.py" -o -name "apify_dubai_scraper.py"

Repository: vishnu-madhavan-git/automation

Length of output: 137


🏁 Script executed:

# Check for version requirements in common config files
cat pyproject.toml 2>/dev/null | grep -i python || echo "No python requirement in pyproject.toml"
cat setup.py 2>/dev/null | grep -i python || echo "No setup.py found"
cat setup.cfg 2>/dev/null | grep -i python || echo "No setup.cfg found"

Repository: vishnu-madhavan-git/automation

Length of output: 150


🏁 Script executed:

# Check the agent files for parameterized built-ins and union syntax
wc -l agents/dubai_villa_scraper.py agents/apify_dubai_scraper.py

Repository: vishnu-madhavan-git/automation

Length of output: 157


🏁 Script executed:

# Search for parameterized built-ins (list[...], dict[...], etc.) and union syntax (X | Y)
rg "list\[|dict\[|tuple\[|set\[|\s\|\s" agents/

Repository: vishnu-madhavan-git/automation

Length of output: 1402


🏁 Script executed:

# Check the first 50 lines of each agent file for imports and type hints
head -50 agents/dubai_villa_scraper.py

Repository: vishnu-madhavan-git/automation

Length of output: 1482


🏁 Script executed:

# Check the apify agent file as well
head -50 agents/apify_dubai_scraper.py

Repository: vishnu-madhavan-git/automation

Length of output: 1584


Specify Python 3.10+ requirement for these commands.

The agents use PEP 604 union syntax (str | None) and parameterized built-ins (list[str], dict, etc.), which require Python 3.10+ and 3.9+ respectively. The README shows unqualified python commands, so users with system Python < 3.10 will fail at runtime. Add a note requiring Python 3.10+.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 57 - 70, Update the README to explicitly require
Python 3.10+ for running the scraper commands: add a short note above the
examples stating "Requires Python 3.10+" (or "Python 3.10 or later") because the
agents (agents/dubai_villa_scraper.py and agents/apify_dubai_scraper.py) use PEP
604 union syntax (e.g., str | None) and newer parameterized built-ins; also keep
the APIFY_TOKEN note for the Apify scraper so users know environment setup
requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants